Markov Chain Monte Carlo for Arrangement of Hyperplanes 
in Locality-Sensitive Hashing. 

Yui Noma * Makiko Konoshima*''' 



o 

(N 



oo 



o 

o 



> 

OS 



o 

m 



x 



Abstract 

Since Hamming distances can be calculated by bitwise 
computations, they can be calculated with less com- 
putational load than L2 distances. Similarity searches 
can therefore be performed faster in Hamming distance 
space. The elements of Hamming distance space are 
bit strings. On the other hand, the arrangement of 
hyperplanes induce the transformation from the fea- 
ture vectors into feature bit strings. This transforma- 
tion method is a type of locality-sensitive hashing that 
has been attracting attention as a way of performing 
approximate similarity searches at high speed. Super- 
vised learning of hyperplane arrangements allows us to 
obtain a method that transforms them into feature bit 
strings reflecting the information of labels applied to 
higher-dimensional feature vectors. In this paper, we 
propose a supervised learning method for hyperplane 
arrangements in feature space that uses a Markov chain 
Monte Carlo (MCMC) method. 

We consider the probability density functions used 
during learning, and evaluate their performance. We 
also consider the sampling method for learning data 
pairs needed in learning, and we evaluate its perfor- 
mance. We confirm that the accuracy of this learning 
method when using a suitable probability density func- 
tion and sampling method is greater than the accuracy 
of existing learning methods. 

Keyword: Higher dimensional feature vector, 
Locality-sensitive hashing, Arrangement of hyper- 
planes, Similarity search, Markov chain Monte Carlo, 
Low-temperature limit 

1 Introduction 

Unstructured data such as audio and images includes 
complex content. This makes it difficult to search for 
unstructured data directly. A common approach has 
therefore been to perform searches based on feature 
vectors extracted from unstructured data. To reflect 
the complexity of unstructured data, these feature vec- 
tors generally consist of higher-dimensional data with 
hundreds or even thousands of dimensions. 

There are a wide range of applications for high-speed 
similarity searching using higher-dimensional feature 
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quantities extracted from unstructured data. Exam- 
ples include authentication of people by fingerprint 
recognition, speech recognition in call centers, manage- 
ment of products and components based on CAD data, 
and detecting abnormal situations from surveillance 
video. For these applications, there are two things 
that are very important. One is a high-speed similarity 
search method. The other is a data structure that per- 
mits high-speed similarity searching, and a method for 
extracting feature quantities that reflect the properties 
of unstructured data. 

The high-speed similarity search method is described 
first. To perform a similarity search, the feature space 
should be a metric space. In most cases, the fea- 
ture space is treated as an L2 metric space. Many 
studies have devised an index structure aimed at per- 
forming similarity searches at high speed. For exam- 
ple, the literatures [TJ[2] are two of them. However, 
in higher-dimensional space, due to the so-called "the 
curse of dimensionality" , all distances between data 
items arc of similar size. Consequently, searches in 
higher-dimensional data using these methods end up 
having processing times that are similar to those of 
searches performed without using a special index [3]. 

Hamming distances can be calculated by bitwise op- 
erations, which means that similarity searches are fast 
in Hamming metric space without using a specific in- 
dex structure. In a method called locality-sensitive 
hashing [3], the feature vectors are transformed into 
bit strings. For this transformation, methods that in- 
volve the use of hyperplanes in feature space have been 
intensively studied [5HH]. In these methods, multiple 
hyperplanes are considered as a means of partition- 
ing the feature space. A bit string is assigned to each 
partitioned region, determined from the orientations of 
hyperplanes. Feature vectors extracted from the data 
are allocated in the same way as bit strings assigned 
to the regions that include the feature vectors. Simi- 
lar feature vectors are included in neighboring regions, 
so the bit strings allocated to these feature vectors are 
similar and are separated by small Hamming distances. 
In the following, we will use the term "hashing" to re- 
fer to the process of transforming higher-dimensional 
feature vectors into feature bit strings. 

Next, we consider a data structure that can be 
searched at high speed, and a method for extracting 
feature quantities reflecting the properties of unstruc- 
tured data. From the above discussion, we decided to 
use bit strings as feature quantities, since these are data 
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structures that can be searched at high speed. When 
data has been labeled, supervised learning can be used 
to extract feature bit strings that reflect the labeled 
information. In the following, feature quantities that 
reflect the labeled information are described as high- 
precision quantities. Also, a learning method that can 
extract high-precision feature quantities is described as 
a high-performance learning method. 

Studies aimed at increasing the precision of fea- 
ture quantities associated with the hyperplane hash- 
ing method include the following references [5111 1). In 
learning, the normal vectors of the hypcrplanes are de- 
termined by making the Hamming distances smaller 
between data pairs with a common label and larger be- 
tween data pairs that do not have any common label. 
As the number of bits increases, the degree of freedom 
also increases so that greater precision becomes possi- 
ble. 

Based on this reasoning, we can draw the following 
conclusions regarding high-precision similarity search- 
ing of large quantities of unstructured data. High- 
speed similarity searches are achieved by using bit 
strings in Hamming metric space as feature quantities. 
High precision is achieved by using a large number of 
bits and performing supervised learning with labeled 
data as the training data. However, to reflect the com- 
plexity of unstructured data, it should be noted that 
a single item of unstructured data will not necessarily 
have just one label. 

In this paper, apart from the use of feature bit 
strings, no consideration is given to the processing time 
of the similarity search. Our main focus is on using su- 
pervised learning to improve the precision of feature 
bit strings. 

The method proposed in this paper performs super- 
vised learning using MCMC. The transformation of 
feature vectors into feature bit strings is a discontin- 
uous mapping. This makes it impossible to perform 
naive learning based on gradients. Another approach 
involves introducing a loss function so that the trans- 
formation method can be approximated by a continu- 
ous function. However, the only loss functions found 
so far are strongly dependent on the properties of the 
data set. In our proposed method, each normal vector 
is regarded particle on a unit sphere in feature 
space, and a random walk is performed on this unit 
sphere. In the random walk, a discontinuous function 
can be treated as an evaluation function. We also con- 
sidered sampling methods for training data pairs and 
evaluation functions for use in learning. 

This paper is structured as follows. First, in sec- 
tion[5]wc describe the existing learning methods. Then 
in section [3J we describe our proposed method. We 
considered evaluation functions needed during learn- 
ing, and sampling methods for training data pairs. In 
section 2J we perform experiments using various data 
sets. At the same time, we also evaluate the evalua- 
tion functions and the sampling methods. In section[5l 
we show that the proposed learning method performs 



better than existing methods. Finally, in section [6] we 
summarize our work and discuss the future prospects 
of this approach. 

2 Background and related work 

In this section, we describe the use of hypcrplanes for 
locality-sensitive hashing, which is the basis of the pro- 
posed technique. We then describe some related exist- 
ing techniques. 

2.1 Conventional locality-sensitive 
hashing with hyperplanes 

The hashing method using hyperplanes is described be- 
low. A space V in which there are higher-dimensional 
feature quantities is regarded as an A-dimcnsional vec- 
tor space. The configurations of multiple hypcrplanes 
in V are referred to as hyperplane arrangements. 

Consider B hyperplanes passing through the origin 
of V . A hyperplane passing through the origin is iden- 
tified by its normal vector. An iV-dimcnsional feature 
vector x is transformed into a bit string by registering 
a 1 if its dot product with each normal vector is posi- 
tive, and a zero otherwise. Therefore, the length of the 
bit string is equal to the number of hyperplanes B. 

A hyperplane that does not pass through the origin 
can easily be constructed from a hyperplane that does. 
In reference , an experiment is performed where the 
hashing of hyperplanes that do not pass through the 
origin is learned by learning the hashing of hypcrplanes 
that do pass through the origin. When developing a 
new learning method for hypcrplanes, it is easier to 
work with hyperplanes that pass through the origin. 
In the following discussion, therefore, all hyperplanes 
are assumed to pass through the origin. 

When labels have been applied to the data, it is 
sometimes the case that the angles or L2 distances do 
not exhibit a suitable degree of dissimilarity. In such 
cases, the hypcrplanes can be determined by supervised 
learning. In supervised learning, the hyperplanes are 
determined so that data pairs with a common label are 
separated by small Hamming distances, and data pairs 
that do not have any common label are separated by 
large Hamming distances. 

A single hyperplane can be specified by specifying its 
normal vector. Since the length of the normal vector 
specifying a hyperplane is immaterial, these lengths are 
chosen so that the configuration space of normal vec- 
tors corresponds to an N — 1-dimensional hyperspherc 
S' JV ~ 1 . When distinguishing between B hyperplanes, 
the configuration space of the hypcrplanes is (5 ,JV ~ 1 ) B . 

In one hashing method, the B hypcrplanes are set 
randomly [5|. In the following, this is referred to as 
the LSH method. 

Other references such as [5HTU] describe hashing 
methods that use hyperplanes. In particular, MLH [§] 
and S-LSH [TU] are described in subsections l2.2l and l2.3l 
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2.2 Minimal loss hashing 

One existing learning method is Minimal Loss Hashing 
(MLH) [3]. In MLH, the aim is to minimize an empir- 
ical loss function on (5' JV ~ 1 ) B . However, the empirical 
loss function is discontinuous, so it is not possible to 
use learning methods based on gradients. Therefore, 
the empirical loss is replaced by a differentiable upper 
bound function g, and gradients are used to minimize 
g instead. The point that gives the minimum value de- 
termines the coordinates of the B learned hyperplanes. 
Function g has several parameters that need to be ad- 
justed. Some of these parameters are dependent on 
the data pairs used for training. All the data pairs for 
learning are chosen at random. 

2.3 Locality-sensitive hashing with 
margin based feature selection 

In this subsection, we describe the concept of an ex- 
isting learning method called locality-sensitive hashing 
with margin based feature selection (S-LSH) [TU]. In 
S-LSH, the normal vectors of hyperplanes are not used 
directly for learning. B hyperplanes (B > B) are ran- 
domly provided. A degree of importance is allocated to 
each hyperplane, and these degrees of importance are 
calculated by learning. The degrees of importance are 
arranged in descending order, and the topmost B nor- 
mal vectors are selected. The distance calculations dur- 
ing learning are performed using weighted Hamming 
distances. Two types of data pairs are used during 
learning. The learning data pairs are selected as fol- 
lows. A feature vector a is randomly selected from the 
learning data. The first type of data pair consists of the 
pair (a, b), where b is the feature vector with the small- 
est weighted Hamming distance in the data set that 
has a common label as a. The second type of data pair 
consists of the pair (a, c), where c is the feature vector 
with the smallest weighted Hamming distance in the 
data set that does not have any common label as a. 

S-LSH has been shown to have good learning perfor- 
mance in many data sets [TU] . It is particularly effective 
for learning in cases where there are many labels, and 
data with the same label has little cardinality. 

3 The proposed method 
3.1 Motivation 

To perform a high-speed similarity search that accu- 
rately represents the latent similarities of unstructured 
data, we consider performing learning with a greater 
number of bits B. In learning, the normal vectors of 
the hyperplanes are determined by making the Ham- 
ming distances smaller between data pairs with a com- 
mon label and larger between data pairs that do not 
have any common label. In the following, we will re- 
fer to a data pair with a common label as a "positive 



pair" , and to a data pair that do not have any common 
label as a "negative pair" . 

The configuration space of a set of B hyperplanes is 
(S N ~ 1 ) B . When there is an evaluation function U' on 
(S N ~ 1 ) B that has the following performance, learning 
a set of hyperplanes can be regarded as an optimization 
problem that globally maximizes U' . The argument of 

V is the arrangement of multiple hyperplanes. Each 
hyperplane divides the feature space V into two re- 
gions. The value of U' increases as the number of pos- 
itive pairs whose feature vectors are in a same region 
and negative pairs whose feature vectors are in different 
regions increase. In most cases, a function U' having 
this property is thought to have multiple local maxima. 
When B is large, the dimension of (S ,W_1 ) B increases 
and it becomes harder to solve the optimization prob- 
lem. Since we are concerned here with hashing using 
hyperplanes, the values of function U' can also be dis- 
crete. Therefore, it is unnatural to require continuity 
of the U' configuration space (S 1 * -1 ) 8 . Since U' is not 
necessarily differentiable, it cannot be globally maxi- 
mized by methods that use the gradient of U' . 

Instead of solving an optimization problem in 
(S N ~ 1 ) B , we can consider a method where optimiza- 
tion problems in S N ~ X are solved B times, and these 
solutions are bundled together. That is, instead of 
learning a set of B hyperplanes, the individual hyper- 
planes are separately learned and the results are bun- 
dled together. However, if we obtain B solutions to the 
optimization problem in S N_1 , then the performance 
is severely impaired for the following reason. Consider 
an evaluation function U on S^ -1 . Assume that the 
points S N_1 where the value of U is larger correspond 
to a good hyperplane. That is, we assume the follow- 
ing property. When the feature space V is partitioned 
into two regions by a single hyperplane, the value of 
U increases as the number of positive pairs whose fea- 
ture vectors are in a same region and negative pairs 
whose feature vectors are in the different regions in- 
crease. A specific example of an evaluation function 
is shown in subsection 13.31 We will assume that the 
evaluation function U has a global maximum value on 
p* G S N—1 . If all the hyperplanes exist in p„, then 
they are all degenerate. In this case, the feature space 

V is only divided into two regions, and there are only 
two types of representative bit string. Clearly it would 
not be possible to capture the features of unstructured 
data with these bit strings. For this reason, when we 
consider bundling together the learning results of in- 
dividual hyperplanes, it can be said that individual 
hyperplanes are not necessarily learned by finding the 
point where the evaluation function U on S N_1 is max- 
imized. Therefore, in the following we consider finding 
B points where the evaluation function U has a lo- 
cal maximum value in S N ~^ when performing learning 
with individual hyperplanes. Multiple hyperplanes are 
learned by bundling these together. Here, we must en- 
sure that the multiple hyperplanes are not oriented in 
the same direction. 
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In the remainder of this section, we describe the pro- 
posed method, which is a hyperplane normal vector 
learning method (referred to below as M-LSH) based 
on the Markov Chain Monte Carlo method. At the 
same time, we consider and discuss a number of eval- 
uation functions (which have a strong influence on the 
performance of M-LSH learning), and data pair sub- 
sampling methods. 

3.2 Learning hyperplanes with the 
Markov chain Monte Carlo method 

In this section, we describe the proposed method. 
Our proposed method is a supervised learning method 
for hyperplanes using MCMC. The aim of this learn- 
ing method is to probabilistically determine the point 
where the evaluation function U reaches its local max- 
imum value. The advantage of this method is that it 
does not require a differentiable evaluation function. 
Its disadvantage is that because it uses a Monte Carlo 
method, the point where the evaluation function is lo- 
cally maximized cannot be determined with perfect ac- 
curacy. However, from the properties of MCMC, the 
learned results are highly likely to be close to the point 
where the evaluation function is locally maximized. 
The probability that a particle is in such a place is 
high so that the local maximum value is high, and the 
peak is sharp. 

In the following, we will assume that the evaluation 
function U is a function whose values are positive and 
are bounded on S 1 ^^ 1 . Also, the details of the evalu- 
ation function are assumed to depend on the training 
data pairs. Specific examples of U arc given in subsec- 
tion [331 

Consider a particle on S N_1 whose position equates 
to the normal vector of a hyperplane. li U := — U is 
regarded as the potential energy, then to obtain the 
local minimum value of U, we need to consider the 
motion of dissipative particles. However, since U is 
generally not differentiable, we cannot use optimiza- 
tion methods based on gradients, that is, continuous 
particle motions. Therefore we can consider obtaining 
a minimum solution by a random walk method. 

We regard U as the probability density function of 
5 W_1 (except for a normalization constant), and use 
MCMC to evaluate the temporal evolution of particles. 
This method is our proposed M-LSH method. 

We use the Metropolis-Hastings algorithm for 
MCMC [12]. For the proposed density function, we 
use the normal distribution. In M-LSH, particles per- 
form random walks a fixed number of times. This is 
the temporal evolution of the particles. We refer to this 
temporal evolution as a single batch process. Since the 
details of the evaluation function are determined by de- 
ciding on the training data pairs, the handling of the 
training data pairs may lead to incidental local maxi- 
mum values of the evaluation function. To prevent the 
particles from becoming trapped at this sort of point, 
batch processing is performed a number of times, and 



the learning data pairs are replaced for each batch pro- 
cess. 

By performing learning with multiple hyperplanes, 
we obtain multiple points where the evaluation func- 
tion is locally maximized. As described in subsec- 
tion !3.1[ it is necessary to prevent the learning of points 
where multiple hyperplanes produce the same local 
maximum value. In M-LSH, this issue is resolved by us- 
ing the following method. MCMC exhibits a property 
whereby particles tend to accumulate at places where 
the probability density function is locally maximized. 
The sharper the peak in the evaluation function close to 
the local maximum value, the more intense this trend 
becomes. In most cases, since U is multimodal, making 
its peaks sharper and randomly setting the initial posi- 
tions of the particles will cause the particles to collect 
at peaks close to their initial positions. Therefore, we 
can prevent the particles from all moving towards the 
same point. 

Many variants of M-LSH can be considered. These 
variants can be obtained by combining the evaluation 
function U with sampling methods for training data 
pairs that determine the details of the evaluation func- 
tion. Table Q] lists these combinations. Each of these 
items is described below in subsections 13.31 and 13.41 



M-LSH with 



Evaluation functions 



COUNT 
RATIO 
COSINE 
COSINE RATIO 



and 



Sampling methods 



Randomhit-Randommiss 

Randomhit-Nearmiss 

Nearhit-Nearmiss 

Farhit-Nearmiss 

Randomhit-Nearnearmiss 



Figure 1: M-LSH variants. The number of variant is 
the number of combinations of evaluation functions and 
sampling methods 



3.3 Evaluation function 

When learning is performed by M-LSH, the type of 
evaluation function must be determined. A number of 
possible evaluation function types are considered be- 
low. First we will introduce some nomenclature. PP 
denotes the set of all given positive pairs, and NP de- 
notes the set of all given negative pairs. The angles 
subtended by the two feature vectors of a pair p rela- 
tive to the normal vector of a hyperplane are 6\ (p) and 
02 (p), respectively. The following subsets are defined. 

PP+ := {pePP|cos(ei(p))*co8(fl a (p))>0}(l) 
7VP_ := {p e NP\ cos(0i(p)) * cos(0 2 (p)) < 0}(2) 

In the following, the cardinality of a set A is denoted 
by #A 
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In the following formulas, we assume an evaluation 
function U = exp(x/T) that uses an enumerated value 
x. Here, T = 1. 



COUNT 



RATIO 



COSINE 



#PP+ + #7VP_ 



#PP #iVP ' 



(3) 



(4) 



a; = |cos(fli(p))+cos(fl a (p))| 
pePP 

+ ^ |cos(0i(p))-cos(0 a (p))|. (5) 
COSINE_RATIO 

" pGPP 

+ E l«»(fll(p))-COB(fl2(p))[fi) 

If ~x and T arc regarded as the particle's energy 
and temperature respectively, then U can be regarded 
as a Boltzmann weight. From this perspective, the 
low-temperature limit is where T is zero, and the high- 
temperature limit is where T is oo. 

3.4 Sampling method for training data 



The evaluation function defined in subsection 13.31 must 
include both PP and NP. PP and NP can be deter- 
mined by considering all combinations of the training 
data. However, the way in which the data pairs are 
obtained means that the number of pairs is only half 
the square of the number of training data items. When 
the cardinality of PP and A^P is large, it can take a 
long time to calculate the evaluation function. 

Therefore in this subsection we consider a number 
of different selection methods for PP and A^P, and we 
discuss their advantages and disadvantages. Here, we 
will use the term "distance" to refer to L2 distance 
unless otherwise noted. 

We will also use the following nomenclature. L is the 
set of all the training data. L a represents a data set 
having a common label as an element a G L, and L c a 
represents L \ L a . The distance between two elements 
a, b G L is denoted by dist(a, b). 

We will start by discussing the selection method for 
A^P. We will consider the following sampling method. 

Randommiss 

After a G L has been randomly selected, b G L c a is 
randomly selected to form a negative pair (a, b). 



Nearmiss 

After a G L has been randomly selected, a 
negative pair (a, b) is formed such that b := 
arg min c gLc (dist(a, c)). 

Boundarymiss 

After a G L has been randomly se- 
lected, a negative pair (a', b) is formed 
such that b := argmin c g£c (dist(a, c)) and 
a' := &rgmm ceLanL c(dist(b 7 c)) . 

Since Boundarymiss as defined above is a new sam- 
pling method, we will describe it in more detail here. 
Consider two elements a,b G L that do not have any 
common label and L a C\L ^ 0. Since the labels applied 
to unstructured data can be of more than one type, this 
sort of situation can occur frequently. Since the distri- 
butions of L a and L are overlapping, it is not possible 
to obtain a hyperplane that separates them completely. 
If we are allowed to bisect L a n L with a hyperplane, 
then it may also be possible to separate the difference 
sets L a \ L and L \L a . To learn a hyperplane that bi- 
sects L a r\L , we can form a negative pair by selecting 
one data item from each of L a \ Lb and Lb\L a . Bound- 
arymiss is one of the ways in which negative pairs of 
this sort can be made. Furthermore, Boundarymiss is 
expected to lie close to the boundary between L a \ Lb 
and Lb\L a . Please refer to Fig. [5] 

In Randommiss sampling, there is a high possibil- 
ity of selecting a pair comprising an element close to 
the center of gravity of L a and an element close to 
the center of gravity of L c a . This makes it easier to 
learn a hyperplane that separates the center of grav- 
ity of L a from the center of gravity of L c a . When L c a 
is distributed over a broader region than L a , it is ex- 
pected that the resulting hyperplane will be deviated 
from the boundary of L a and L c a . In particular, when 
the number of labels applied to the training data is 
large and the sets of each label have similar cardinal- 
ity, the distribution of L c a tends to become broader 
than that of L a , so the tendency for the learned hyper- 
planes to be separated from the boundary is thought 
to become more pronounced as the number of labels 
increases. Figure [5] shows some typical data pairs ob- 
tained by Randommiss sampling, and the hyperplanes 
learned from these pairs. The dotted circles in these 
figures show the approximate regions over which these 
sets are distributed. 

In Nearmiss sampling, an element selected from L c a 
lies close to the boundary of L a and L c a , so it is possible 
to avoid the above drawback of the Randommiss sam- 
pling method. However, when L a is distributed over 
a wide region, there is a greater likelihood of a G L 
being deviated from the boundary between L a and L c a . 
Figure shows some typical data pairs obtained by 
Nearmiss sampling, and the hyperplanes learned from 
these pairs. 

When Boundarymiss sampling is performed, it can 
compensate for the abovementioned drawbacks of the 
Nearmiss method, but is liable to choose data pairs 
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that are separated by smaller distances and is therefore 
more susceptible to noise in the data. 

We will now describe the selection method for PP. 
For the reasons discussed below, it is better to consider 
the handling of positive pairs in terms of applying cor- 
rections to the discriminant planes used for the discrim- 
ination of negative pairs. For example, if we learn with 
only positive pairs, because the hypcrplanes should not 
separate feature vectors in each positive pair, the hy- 
pcrplanes may stay away from all the feature vectors. 
In this case, all the bit strings of the training data 
will be identical, making it impossible to separate the 
feature vectors. We therefore consider positive pairs 
to have the role of preventing L a from becoming sepa- 
rated by the hyperplane. We will consider the following 
sampling methods for positive pairs. 

Randomhit 

After a G L has been randomly selected, b G L a is 
randomly selected to form a positive pair (a, b). 

Near hit 

After a 6 L has been randomly selected, a 
positive pair (a, 6) is formed such that b := 
argmin ce i a (dist(a, c)). 

Farhit 

After a 6 L has been randomly selected, a 
positive pair (a, b) is formed such that b := 
argmax G gi (dist(a, c)). 

We consider a data set whose clement have a single 
label. In this case, it is considered that Farhit sampling 
frail against outliers. It is thought that Randomhit 
sampling is robust against outliers. Nearhit sampling 
is expected to have poor performance because it is not 
possible to prevent data other than the selected data 
pair in L a from being arranged in different directions 
of the hyperplane. It is thought that this performance 
degradation is particularly severe when there are many 
data items with the same label. 

We consider a data set whose clement have 
an arbitrary number of labels. Here, we con- 
sider the case where there are three positive pairs 
(a 1 ,b 1 ),(a 2 ,b 2 ),(a 3 ,b 3 ) G PP., such that a\ £ L a2 A 
b\ G Lb 2 A 0,3,63 G Lb 2 . In particular, when 0,3 and 
63 are close to b\ and 62 respectively, we shall refer to 
these data pairs as overlapping data pairs. This is sum- 
marized in Fig. [3] When learning is performed in this 
case, it becomes difficult to separate L ai and L a2 from 
Lb 2 ■ As the number of sampling pairs increases, it is 
thought that overlapping data pairs will become more 
common. It is therefore expected that Farhit sampling 
and Randomhit sampling will cause the performance to 
become worse. In the case of Nearhit sampling, since 
0.3 and 63 are less likely to be close to b\ and 62, it 
is thought that performance degradation will be less 
likely to occur. 

Based on this reasoning, there are as many possible 
sampling methods as there are combinations of pos- 
itive pair and negative pair sampling methods. The 
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Figure 3: Schematic illustration of overlapping data 
pairs 

sampling methods that we actually evaluated and com- 
pared are as follows! . 

• Randomhit-Randommiss 

• Randomhit-Nearmiss 

• Nearhit-Nearmiss 

• Farhit-Nearmiss 

• Randomhit-Boundarymiss 



4 Experiments and evaluation 

We performed experiments to measure the effects of 
the proposed method on a number of different data 
sets. In these experiments, supervised learning was 
performed on data that had already been labeled. The 
data labels used in these experiments are all known. 
When the search results are obtained, the Hamming 
distance between the query and data in the database 
is calculated, and the top search results are ordered 
in ascending order of distance. The acquisition rate is 
defined for this purpose as follows. 

Number of data acquired by search 

Acquisitions . (7) 

Total number of data searched 

To evaluate the performance, we used the precision rate 
and recall rate as defined below. 



Number of data items with a com- 
mon label as the query for which 

search results were obtained 

Number of data items for which 

search results were obtained 
Number of data items with a com- 
mon label as the query for which 

search results were obtained 

Number of data items with a com- 
mon as the query among all relevant 
items 



Precision 



Recall 



(8) 



(9) 



A recall-precision curve shows the variation of recall 
rate and precision rate with changes in the acquisition 
rate. Better search performance is indicated by a recall 
rate and precision rate with values closer to 1. 

In the experiments, by way of reference, we also cal- 
culated the precision rate and recall rate in similarity 
searches based on L2 distances using the original fea- 
ture vectors. 



We believe that this choice is natural. 
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Figure 2: Selection methods for negative pairs. (From left to right) Randommiss, Nearmiss, Boundarymiss. 
The red triangles, red circles and blue triangles correspond to element a, element b and element a' respectively. 
The blue dotted line represents the hypcrplane expected to be obtained by learning. 



The remainder of this section is structured as fol- 
lows. First, we confirm the benefits of M-LSH on learn- 
ing with an artificial data set that we prepared. We 
then evaluate the performance of the evaluation func- 
tions and sampling methods considered in sections 13.31 
and 13.41 For this performance evaluation, we used ac- 
tual data sets instead of our prepared data set. Finally 
we show how our proposed method differs from existing 
learning methods. 

4.1 Experiments with an artificial data 
set 

Using an artificial data set, we confirmed the effects 
of M-LSH on learning. This artificial data set consists 
of 300 data items sampled from a three-dimensional 
standard normal distribution. With the axes labeled as 
x, y and z, we classified the data items into two classes 
according to whether the x component was positive or 
nonpositivc. As can be seen from the way in which the 
data is labeled, we desire a hyperplane whose normal 
vector n is n = (±1, 0, 0). 

Figure g] shows the effects of LSH and M-LSH on 
learning with a bit string length of 1,024. The param- 
eters of learning with M-LSH were as follows: number 
of processing batches: 5, number of temporal evolu- 
tion steps in batch processing: 100, number of data 
pairs used for learning in each batch process: 2,000, 
number of evaluation functions used during learning: 
COUNT, sampling method: Randomhit-Randommiss, 
with equal numbers of positive and negative pairs. 

From Fig. 01 we can see the following. From the scat- 
ter diagram and x component histogram of the normal 
vectors obtained by LSH, we can see that the normal 
vectors are uniformly distributed on a two-dimensional 
sphere. From the scatter diagram and x component 
histogram of the normal vectors obtained by M-LSH, 
we can see that most of the normal vectors are dis- 
tributed in the vicinity of n — (±1,0,0). Figure [5] 
shows the precision rates and recall rates of LSH and 
M-LSH. As expected from the distribution of normal 
vectors, Fig. [5] shows that M-LSH has a positive effect 
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Figure 5: Precision rate and recall rate curves of LSH 
and M-LSH learning with artificial data set. 

on learning. 

4.2 Experimental data 

In this subsection, we describe the experimental data 
used in the performance evaluations performed in sub- 
sections H3] and IHU 

The experimental data was obtained from the fol- 
lowing sources. 

• MNIST 

Scanned images of handwritten numerals 0-9 |13| . 
Each digit is stored as a 28 x 28-pixel 8-bit 
grayscale image, and is labeled with the corre- 
sponding digit 0-9. We used the images them- 
selves as feature quantities. Therefore, the feature 
quantities had 784 dimensions. 

• Fingerprint images 

Fingerprint image data acquired using a finger- 
print image scanner. The feature quantities con- 
sisted of the 4,096-dimcnsional Fourier spectra of 
these fingerprint images. Since fingerprints are 
unique to each human, the data was labeled with 
the names of the corresponding individuals. In 
other words, each feature vector was given just 
one label. For details, see Ref. [TO] . 



M-LSH with COUNT 





M-LSH will, COUM 





Figure 4: LSH and M-LSH learning results with artificial data set. Top left: Scatter diagram of normal vectors 
learned by LSH; Bottom left: Histogram of x components. Top right: Scatter diagram of normal vectors learned 
by M-LSH; Bottom right: Histogram of x components. 



• Speech features 

A set of 200-dimensional mcl frequency ccpstral 
coefficient (MFCC) feature quantities extracted 
from a three-hour recording of a local govern- 
ment assembly published on the Internet [33] . The 
query was a spoken sound acquired separately. In 
supervised learning of speech, the contents of the 
speech are normally labeled with a text transcrip- 
tion. But instead, we treated the features with 
the top 0.1% shortest Euclidean distances from 
the queries were regarded to be in the same class. 
Each feature vector could have multiple labels. 

• LabelMe 

LabelMc data using 512-dimcnsional Gist feature 
quantities [15] extracted from image data pub- 
lished in Ref. [16] . The labeling was applied to 
the dissimilarity matrices of distributed data, and 
the same labels were applied to data correspond- 
ing to the topmost 50 rows of data in each row. 
Each feature vector can have multiple labels. 

These data sets were selected with the following ap- 
plications in mind — MNIST: handwritten number 
recognition, Fingerprint images: biometric identifica- 
tion, Speech features: speech recognition, LabelMe: 
automatic image classification. 



The quality of data used in the experiments is sum- 
marized in Table [T] Here, we envisaged performing 
searches on data recorded in a database, with the data 
divided into three pairwise disjoint sets: a data set used 
for learning, the searched data set, and a data set for 
queries. The learning performance varied widely de- 
pending on the number of labels in the data set and 
on the cardinality of data sets having a common label. 
However, since the data sets were not all given unique 
labels, it is not possible to give a naive definition of the 
label numbers. We therefore reasoned as follows. For 
the data actually used for learning, the average value 
of the number of data items having a common label as 
one item of data is regarded as the rough cardinality of 
the sets for each label. The rough number of labels is 
then calculated by dividing the number of data items 
actually used for learning by the rough cardinality of 
the sets for each label. This information is summarized 
in Table [U 

Since data generally contains noise, noise reduction 
must be performed. Prior to the experiments, we sub- 
jected all the data to the following noise reduction pro- 
cesses. These processes are widely used as noise reduc- 
tion methods. The feature quantities of the data are 
higher-dimensional data. Depending on the data set, 
each component of the data may be expressed in dif- 
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ferent units. Unless the feature vectors are made di- 
mensionless, they cannot be used for the calculation of 
distances or angles. We therefore subjected the data 
to an afhne transformation so that the average value 
of each component of the feature vector of the learn- 
ing data became 0, and the standard deviation of the 
learning data became 1. We also performed a princi- 
pal component analysis for the learning data. This was 
done by finding the subspace with a cumulative contri- 
bution rate of over 80%, and mapping all data to this 
space. After the above noise reduction process, we per- 
formed learning and search tests. Since all the methods 
were evaluated using data that had been subjected to 
this noise reduction process, this noise reduction pro- 
cess had no effect on the performance of each method. 

The parameters of the M-LSH experiments were as 
follows. Standard deviation of proposed density distri- 
bution: 0.01, number of processing batches: 10, num- 
ber of temporal evolution steps in batch processing: 
100, number of data pairs used for learning in each 
batch process: 2,000 or 20,000. 

4.3 Evaluation function performance 

Here, we evaluate the performance of the evalua- 
tion functions cited in subsection 13.31 A natural 
choice of sampling method is the simplest Randomhit- 
Randommiss sampling method. However, as was found 
in subsection 14.41 the Randomhit-Randommiss sam- 
pling method has poor performance. 

Therefore, we instead used Randomhit-Nearmiss 
sampling, which is regarded as the next simplest sam- 
pling method after Randomhit-Randommiss sampling. 

Figure H2 shows a graph of the precision rate and re- 
call rate for data sets with an acquisition rate of 0.1. 
However, since different data sets have different preci- 
sion rates and recall rates, the precision rates and recall 
rates are scaled where the values of searches using L2 
distance are 1. The number of training data pairs used 
for training M-LSH was 2,000. Since Fig. H2 shows the 
scaled precision rate or the scaled recall rate, larger 
values indicate better performance from the learning 
method. 

Although degraded in Fig. [SJ the M-LSH perfor- 
mance obtained using RATIO or COSINE_RATIO is 
more or less unchanged from that of LSH. Using 
COUNT, M-LSH performs better for all data sets. Us- 
ing COSINE, M-LSH performs worse for all data sets. 

The reason why M-LSH using RATIO or CO- 
SINE_RATIO has almost the same performance as LSH 
is thought to be as follows. The evaluation function in- 
cludes a parameter T that is analogous to temperature. 
Since we used a fixed value of T = 1 in this evaluation, 
the index of the evaluation function is confined to the 
range [0,2] or [0,4]. In this range, a slight change of 
the normal vector will not cause a large change in the 
value of the evaluation function. Therefore, the nor- 
mal vector moves about more or less at random, so no 
large difference from LSH is obtained. In other words, 



in this evaluation function it can be said that T = 1 
corresponds to a high temperature. To increase the 
performance of the evaluation function, we should use 
a smaller T (i.e., a lower temperature), and expand 
the range of the evaluation function index to make the 
maximum value peaks sharper. However, at this limit, 
it can be approximated by COUNT. For this reason, 
at the low temperature limit, it is thought that these 
two evaluation functions exhibit more or less the same 
performance as M-LSH when using COUNT. 

It can be seen that COSINE performed much worse 
than LSH for the following reason. In COSINE, there is 
a gentle evaluation function gradient at all points in the 
region where the normal vector is defined. Therefore, 
the normal vectors tend to be oriented toward the point 
that shows a global maximum value. To see that the 
normal vector actually exists at a point showing the 
maximum value, we calculated the absolute value of the 
cosine between normal vectors. A larger absolute value 
of the cosine means that the vectors are pointing in 
similar directions. Figure [7] shows the absolute values 
of the cosines made by M-LSH normal vectors using 
32-bit COUNT or COSINE values. In this figure, a 
matrix is calculated with the absolute values of cosines 
between 32-bit normal vectors as its constituent values, 
and these values are represented as a grayscale image. 
The diagonal elements are all zero. As Fig. [7] clearly 
shows, almost all of the normal vectors obtained with 
M-LSH are oriented in similar directions. In the case 
of COSINE, it is thought that the performance can 
be improved by taking a low temperature limit, as was 
the case for RATIO and COSINE.RATIO. However, at 
this limit, COSINE can be approximated by COUNT. 
Furthermore, since COSINE requires more processing 
time than COUNT, there is no need to bother using 
COSINE. 

Based on the above calculation results and discus- 
sion, it is thought that using COUNT as the evalua- 
tion function is more appropriate from the viewpoint 
of processing time and performance. 

4.4 Evaluation of sampling methods 

Here, we evaluate the performance of the sampling 
methods discussed in subsection 13.41 From the dis- 
cussion of subsection 14.31 we use the M-LSH method 
with the COUNT evaluation function to evaluate the 
performance of the sampling methods. 

In the same way as when evaluating the performance 
of the evaluation functions, we consider the scaled pre- 
cision rate and scaled recall rate when the acquisition 
rate is 0.1. Figure [5] shows a graph of the scaled preci- 
sion rate and recall rate of each sampling method in M- 
LSH using the COUNT evaluation function with 1,024 
bits (except in the batch processing where the number 
of sample data items used was 2,000.) To evaluate the 
dependence on the number of sample data items used 
for training, we also calculated the precision rate and 
recall rate with 20,000 sample data items, as shown in 
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Table 1: Experimental parameters 



Data set 

Parameter 


"II TATTHm 

MNIST 


Fingerprint 


Speech 


LabelMe 


Number of training data items 


60,000 


9,906 


192,875 


11,000 


Number of data items for searching 


5,000 


12,138 


192,683 


5,500 


Number of data items for queries 


5,000 


19,932 


1,815 


5,500 


Dimension before dimensionality reduction 


784 


4096 


200 


512 


Dimension after dimensionality reduction 


149 


276 


30 


20 


Feature vector have unique labels 


Yes 


Yes 


No 


No 


Approximate number of labels 


10 


1300 


2000 


300 


Rough cardinality of the sets for each label 


6000 


7 


100 


40 




MNIST Fingerprint Speech LnbclMc MNIST Fingerprint Speeeh LabelMe 

Data set Data set 



Figure 6: Performance of evaluation functions with different data types: Precision rates (left) and recall rates 
(right) 



Fig. El 

From Figs. \8\ and [9j it can be seen that the perfor- 
mance of M-LSH using Randomhit-Randommiss sam- 
pling is very poor for methods other than MNIST. The 
performance is worse than that of the LSH method. 

The performance of M-LSH using Farhit-Nearmiss 
sampling was the best for MNIST. However, it had the 
worst performance for LabelMe. 

It can be seen that the performance of M-LSH 
with the Nearhit-Nearmiss sampling method is de- 
pends strongly on the number of training data pairs. 
For the speech features and LabelMe data sets, the per- 
formance improves as the number of training data pairs 
increases. This performance improvement is thought 
to be due to the low probability of there being over- 
lapping data pairs. For MNIST, the performance de- 
creases as the number of training data pairs increases. 
This effect is thought to occur in the following way. As 
mentioned above, the role of positive pairs is to pre- 
vent data sets with a common label from being split 
by hyperplanes. It is therefore desirable that positive 
pairs are widely distributed across data sets having a 
common label. Ncarhit sampling creates positive pairs 
by choosing the closest feature vectors with a common 
label, so a large number of positive pairs are needed 
for the distribution of a data set having a common la- 
bel to be satisfied with a positive pair. In particular, 
MNIST requires more positive pairs than other data 
sets because there are a great many data items that 
have the same label. The role of negative pairs is to 



separate data sets having different labels. Therefore, a 
number of negative pairs roughly equal to the square 
of the number of labels is sufficient. Since the positive 
pairs and negative pairs were used in equal numbers 
in these experiments, it seems that the effect of nega- 
tive pairs in separating data sets having different labels 
outweighed the effect of positive pairs in preventing the 
separation of data sets having a common label. We 
think this is the reason why the performance decreases 
as the number of training data pairs is increased. 

The M-LSH method using Randomhit-Ncarmiss 
sampling and Randomhit-Boundarymiss sampling per- 
formed well for all data sets, regardless of the number 
of training data pairs. For MNIST and fingerprint im- 
ages, the performance improves as the number of train- 
ing data pairs is increased. However, for the speech 
features and LabelMe data sets, the performance was 
found to decrease as the number of training data pairs 
is increased. This is thought to be due to an increase in 
the number of overlapping data pairs. No large differ- 
ences could be observed between these two sampling 
methods. However, for the speech features and La- 
belMe data sets, the performance was very slightly bet- 
ter with M-LSH using Randomhit-Boundarymiss sam- 
pling. 

Based on these results, it is thought that the appro- 
priate choice of sampling method depends on the prop- 
erties of the data. Of the sampling methods we tried 
out in this study, it seems that the following choices 
are robust methods. 
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Figure 7: Cosines between normal vectors in M-LSH results learned for MNIST, with COUNT (left) and 
COSINE (right) used as the evaluation function. 



• For data sets where each feature vector has 
unique label: Randomhit-Nearmiss sampling or 
Randomhit-Boundarymiss sampling 

• For data sets where each feature vector has mul- 
tiple labels and there are not many training data 
pairs: Randomhit-Boundarymiss sampling 

• For data sets where each feature vector has mul- 
tiple labels and there are very many training data 
pairs: Nearhit-Nearmiss sampling 

5 Comparison with existing 
learning methods 

In this section, we compare the performance of M- 
LSH with that of the existing learning methods LSH, 
MLH, and S-LSH. M-LSH uses the COUNT evaluation 
function and the Randomhit-Boundarymiss sampling 
method. The number of sample data pairs is 1,000 for 
both the positive pairs and negative pairs. 

Figure [TU] shows the Recall-Precision curves for var- 
ious different data sets. Here, the number of bits is 
1,024. From these results, it can be seen that M-LSH 
outperforms the existing learning methods for all the 
data sets apart from LabelMe. In LabelMe, there are 
small regions where the S-LSH curve rises above the 
precision and recall curves for M-LSH, but it can be 
said that better overall performance is obtained with 
M-LSH. 

From the above results, it is concluded that the pro- 
posed M-LSH learning method has good hashing per- 
formance. 

6 Summary and future works 

In this paper, we proposed a learning method for hy- 
pcrplanes using MCMC. We also considered evaluation 
functions and sampling methods used in this learning 
method, and we evaluated their performance. As a 
result, we have confirmed that this proposed method 
exceeds the performance of existing learning methods. 



Finally, we mention the direction of future research. 
When using the MCMC method for learning, the ul- 
timate positions of particles do not lie at points that 
maximize the evaluation function. One way in which 
this problem could be resolved involves recording the 
particle loci and finding out which point maximizes the 
evaluation function. 
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Figure 10: Recall-Precision curve for MNIST (upper left), fingerprint (upper right), speech (lower left), and 
LabelMe (lower right). 
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